{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "H_D9kG_efts3"
   },
   "source": [
    "# Transformers 模型量化技术：AWQ"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WE9IhcVyktah"
   },
   "source": [
    "![img](https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/Thumbnail.png)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "id": "Wwsg6nCwoThm"
   },
   "source": [
    "在2023年6月，Ji Lin等人发表了论文[AWQ：Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/pdf/2306.00978.pdf)。\n",
    "\n",
    "这篇论文详细介绍了一种激活感知权重量化算法，可以用于压缩任何基于 Transformer 的语言模型，同时只有微小的性能下降。关于 AWQ 算法的详细介绍，见[MIT Han Song 教授分享](https://hanlab.mit.edu/projects/awq)。\n",
    "\n",
    "transformers 现在支持两个不同的 AWQ 开源实现库：\n",
    "\n",
    "- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)\n",
    "- [LLM-AWQ](https://github.com/mit-han-lab/llm-awq) \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-H2019RkoiM-"
   },
   "source": [
    "因为 LLM-AWQ 不支持 Nvidia T4 GPU（课程演示 GPU），所以我们使用 AutoAWQ 库来介绍和演示 AWQ 模型量化技术。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 量化前模型测试文本生成任务"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import pipeline\n",
    "\n",
    "model_path = \"facebook/opt-125m\"\n",
    "\n",
    "# 使用 GPU 加载原始的 OPT-125m 模型\n",
    "generator = pipeline('text-generation',\n",
    "                     model=model_path,\n",
    "                     device=0,\n",
    "                     do_sample=True,\n",
    "                     num_return_sequences=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 实测GPU显存占用：加载 OPT-125m 模型后\n",
    "\n",
    "```shell\n",
    "Sun Dec 24 15:11:33 2023\n",
    "+---------------------------------------------------------------------------------------+\n",
    "| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |\n",
    "|-----------------------------------------+----------------------+----------------------+\n",
    "| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
    "| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |\n",
    "|                                         |                      |               MIG M. |\n",
    "|=========================================+======================+======================|\n",
    "|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |\n",
    "| N/A   47C    P0              26W /  70W |    635MiB / 15360MiB |      0%      Default |\n",
    "|                                         |                      |                  N/A |\n",
    "+-----------------------------------------+----------------------+----------------------+\n",
    "\n",
    "+---------------------------------------------------------------------------------------+\n",
    "| Processes:                                                                            |\n",
    "|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |\n",
    "|        ID   ID                                                             Usage      |\n",
    "|=======================================================================================|\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'generated_text': 'The woman worked as a tour guide, and was also a teacher in her own right. In her'},\n",
       " {'generated_text': 'The woman worked as a sales manager at a grocery store so all she needed was someone to sit next'},\n",
       " {'generated_text': 'The woman worked as a chef for several years before deciding to retire in 2010. She was also the'}]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generator(\"The woman worked as a\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'generated_text': 'The man worked as a \"bait\" during a career that I think had an emphasis on fishing'},\n",
       " {'generated_text': 'The man worked as a construction worker in California for a couple years before he moved into real estate in'},\n",
       " {'generated_text': \"The man worked as a cashier, and he's probably never heard of the place where you're\"}]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generator(\"The man worked as a\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6dJJRQ2p7eLQ"
   },
   "source": [
    "## 使用 AutoAWQ 量化模型\n",
    "\n",
    "下面我们以 `facebook opt-125m` 模型为例，使用 `AutoAWQ` 库实现的 AWQ 算法实现模型量化。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "486f74cb2a7b4d11bcd5fd70ed8277e9",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from awq import AutoAWQForCausalLM\n",
    "from transformers import AutoTokenizer\n",
    "\n",
    "\n",
    "quant_path = \"models/opt-125m-awq\"\n",
    "quant_config = {\"zero_point\": True, \"q_group_size\": 128, \"w_bit\": 4, \"version\": \"GEMM\"}\n",
    "\n",
    "# 加载模型\n",
    "model = AutoAWQForCausalLM.from_pretrained(model_path, device_map=\"cuda\")\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "id": "Qn_P_E5p7gAN"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/repocard.py:105: UserWarning: Repo card metadata block was not found. Setting CardData to empty.\n",
      "  warnings.warn(\"Repo card metadata block was not found. Setting CardData to empty.\")\n",
      "AWQ: 100%|██████████| 12/12 [01:20<00:00,  6.71s/it]\n"
     ]
    }
   ],
   "source": [
    "# 量化模型\n",
    "model.quantize(tokenizer, quant_config=quant_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 实测GPU显存使用：量化模型时峰值达到将近 4GB\n",
    "\n",
    "```shell\n",
    "Sun Dec 24 15:12:50 2023\n",
    "+---------------------------------------------------------------------------------------+\n",
    "| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |\n",
    "|-----------------------------------------+----------------------+----------------------+\n",
    "| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
    "| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |\n",
    "|                                         |                      |               MIG M. |\n",
    "|=========================================+======================+======================|\n",
    "|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |\n",
    "| N/A   48C    P0              32W /  70W |    3703MiB / 15360MiB |      2%      Default |\n",
    "|                                         |                      |                  N/A |\n",
    "+-----------------------------------------+----------------------+----------------------+\n",
    "\n",
    "+---------------------------------------------------------------------------------------+\n",
    "| Processes:                                                                            |\n",
    "|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |\n",
    "|        ID   ID                                                             Usage      |\n",
    "|=======================================================================================|\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "id": "nVzKDBlP_6MV"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "quant_config"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PuPLq9sa8EaN"
   },
   "source": [
    "#### Transformers 兼容性配置\n",
    "\n",
    "为了使`quant_config` 与 transformers 兼容，我们需要修改配置文件：`使用 Transformers.AwqConfig 来实例化量化模型配置`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "id": "KE8xjwlL8DnA"
   },
   "outputs": [],
   "source": [
    "from transformers import AwqConfig, AutoConfig\n",
    "\n",
    "# 修改配置文件以使其与transformers集成兼容\n",
    "quantization_config = AwqConfig(\n",
    "    bits=quant_config[\"w_bit\"],\n",
    "    group_size=quant_config[\"q_group_size\"],\n",
    "    zero_point=quant_config[\"zero_point\"],\n",
    "    version=quant_config[\"version\"].lower(),\n",
    ").to_dict()\n",
    "\n",
    "# 预训练的transformers模型存储在model属性中，我们需要传递一个字典\n",
    "model.model.config.quantization_config = quantization_config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('models/opt-125m-awq/tokenizer_config.json',\n",
       " 'models/opt-125m-awq/special_tokens_map.json',\n",
       " 'models/opt-125m-awq/vocab.json',\n",
       " 'models/opt-125m-awq/merges.txt',\n",
       " 'models/opt-125m-awq/added_tokens.json',\n",
       " 'models/opt-125m-awq/tokenizer.json')"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 保存模型权重\n",
    "model.save_quantized(quant_path)\n",
    "tokenizer.save_pretrained(quant_path)  # 保存分词器"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用 GPU 加载量化模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(quant_path)\n",
    "model = AutoModelForCausalLM.from_pretrained(quant_path, device_map=\"cuda\").to(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_text(text):\n",
    "    inputs = tokenizer(text, return_tensors=\"pt\").to(0)\n",
    "\n",
    "    out = model.generate(**inputs, max_new_tokens=64)\n",
    "    return tokenizer.decode(out[0], skip_special_tokens=True)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Merry Christmas! I'm glad to be the son of the son of the son of the son of the son of the son of the son of the son of be of the son of the son of the son of the son of the son of the son of the son of the son of the son of the son of the son of the son of the\n"
     ]
    }
   ],
   "source": [
    "result = generate_text(\"Merry Christmas! I'm glad to\")\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "id": "Z0hAXYanCDW3"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The woman worked as a teacher at the school told me this, she will only told me something this.  She's \" told her that she's not a child has not a child  he said that she * is a child * has not a child\n"
     ]
    }
   ],
   "source": [
    "result = generate_text(\"The woman worked as a\")\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Homework：使用 AWQ 算法量化 Facebook OPT-2.7B 模型\n",
    "\n",
    "Facebook OPT 模型：https://huggingface.co/facebook?search_models=opt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
